The goal of this experiment is to check if having causality information of the variables of our data set is useful to improve our models.
The experiment has the following steps:
Here I define a causal graph whose relations may cause problems if not adequately treated. Nodes in gray will not be visible in the dataset.
import numpy as np
import pandas as pd
import pandas_profiling as pp
import plotly_express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
import networkx as nx
from nxpd import draw
# np.random.seed(seed=42) # Test set is not representative
np.random.seed(seed=22)
G = nx.DiGraph()
G.graph['dpi'] = 120
G.add_nodes_from(['X', ('A',{'color':'gray'}), 'B', ('C',{'color':'gray'}), 'Y'])
G.add_edges_from([('A','X'),('A','B'),('C','B'),('C','Y'),('X','Y')])
draw(G, show='ipynb')
The problem presented to the ML practitioner will be:
Can you estimate the real influence of X in Y?
I mean, what happens to Y when I increase X by one unit, keeping all the same equal?
Let's define some rules to later generate a data set:
n = 10000
sigma_A = 2
mu_A = -2
A = sigma_A * np.random.randn(n,1) + mu_A
sigma_C = 8
mu_C = 3
C = sigma_C * np.random.randn(n,1) + mu_C
B = 5*A - 2*C + np.random.randn(n,1)/10
X = -3*A + np.random.randn(n,1)/10
Y = X + 2*C + np.random.randn(n,1)
df_data = np.concatenate((A,B,C,X,Y), axis=1)
df = pd.DataFrame(data=df_data, columns=['A','B','C','X','Y'])
df.head()
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='Y'), df[['Y']], test_size=0.3, random_state=42)
y_train = np.squeeze(y_train)
y_test = np.squeeze(y_test)
df_real = df[['B','X','Y']]
X_train_real = X_train[['B','X']]
X_test_real = X_test[['B','X']]
df_real.head()
pp.ProfileReport(df, style={'full_width':True})
pp.ProfileReport(df_real, style={'full_width':True})
px.scatter(data_frame=df, x='X', y='Y')
px.scatter(data_frame=df, x='B', y='Y')
Let's try a Linear Regression. Both X and mainly B are highly correlated with Y, so I will use both variables:
linear_model = LinearRegression()
linear_model.fit(X_train_real, y_train)
predictions_linear_model = linear_model.predict(X_test_real)
print("MAE in test set = {}".format(np.round(mean_absolute_error(predictions_linear_model, y_test), decimals=3)))
It looks like the model has learned the patterns in the data set.
X_train_real.columns
linear_model.coef_
The model says that for each extra unit of X, Y is expected to decrease by 0.67.
But remember that:
Y = X + 2*C
So the influence of X in Y should be around 1, not -0.67 (!)
Looking at the causal graph, it is clear that, in order to get the isolated influence of X we need to ignore B, as considering it will open the collider path and create a counfounding situation.
So let's try the same model, but training only on X:
X_train_causal = X_train[['X']]
X_test_causal = X_test[['X']]
causal_linear_model = LinearRegression()
causal_linear_model.fit(X_train_causal, y_train)
predictions_causal_linear_model = causal_linear_model.predict(X_test_causal)
print("MAE in Test set = {}".format(np.round(mean_absolute_error(predictions_causal_linear_model, y_test), decimals=2)))
The model is quite worse when predicting Y (as expected; it is only considering the influence of X).
causal_linear_model.coef_
This time, the model predicts that for each unit increment in X, Y is expected to increase by 1.05, which is a much better approximation to the real effect value (1).
Let's take a reference value:
X_test.head(1)
What would happen if I increase X by one unit?
do_X_test = X_test.copy()
do_X_test[['X']] = do_X_test[['X']] + 5.
do_X_test.head(1)
ML answer:
predictions_ml = linear_model.predict(do_X_test[['B','X']])
Causal inference answer:
predictions_causal_inference = y_test + 5 * 1.05
Real answer:
real_answer = do_X_test.X + 2.0 * do_X_test.C
print("MAE ML: {}".format(np.round(mean_absolute_error(real_answer, predictions_ml), decimals=2)))
print("MAE Causal Inference: {}".format(np.round(mean_absolute_error(real_answer, predictions_causal_inference), decimals=2)))
The causal inference method performs much better!
This is not surprising after seeing the estimation of coefficients made by each method.
This is a very simple (silly?) experiment, but shows some important truths:
Machine learning models (statistics in general) are very good at predicting when the unseen data is similar to the training data set, but they don't necessarily capture the real underlying patterns.
If we want to predict what will happen if we change a variable, ML will not help, as we are artificially changing the data set distribution.
Causal inference is like an extra dimension that lets us see things we couldn't see before.